Goto

Collaborating Authors

 td method


The Pitfalls of Regularization in Off-Policy TD Learning

Neural Information Processing Systems

Temporal Difference (TD) learning is ubiquitous in reinforcement learning, where it is often combined with off-policy sampling and function approximation. Unfortunately learning with this combination (known as the deadly triad), exhibits instability and unbounded error. To account for this, modern Reinforcement Learning methods often implicitly (or sometimes explicitly) assume that regularization is sufficient to mitigate the problem in practice; indeed, the standard deadly triad examples from the literature can be ``fixed'' via proper regularization. In this paper, we introduce a series of new counterexamples to show that the instability and unbounded error of TD methods is not solved by regularization. We demonstrate that, in the off-policy setting with linear function approximation, TD methods can fail to learn a non-trivial value function under any amount of regularization; we further show that regularization can induce divergence under common conditions; and we show that one of the most promising methods to mitigate this divergence (Emphatic TD algorithms) may also diverge under regularization. We further demonstrate such divergence when using neural networks as function approximators. Thus, we argue that the role of regularization in TD methods needs to be reconsidered, given that it is insufficient to prevent divergence and may itself introduce instability. There needs to be much more care in the practical and theoretical application of regularization to Reinforcement Learning methods.


Is Temporal Difference Learning the Gold Standard for Stitching in RL?

arXiv.org Machine Learning

Reinforcement learning (RL) promises to solve long-horizon tasks even when training data contains only short fragments of the behaviors. This experience stitching capability is often viewed as the purview of temporal difference (TD) methods. However, outside of small tabular settings, trajectories never intersect, calling into question this conventional wisdom. Moreover, the common belief is that Monte Carlo (MC) methods should not be able to recombine experience, yet it remains unclear whether function approximation could result in a form of implicit stitching. The goal of this paper is to empirically study whether the conventional wisdom about stitching actually holds in settings where function approximation is used. We empirically demonstrate that Monte Carlo (MC) methods can also achieve experience stitching. While TD methods do achieve slightly stronger capabilities than MC methods (in line with conventional wisdom), that gap is significantly smaller than the gap between small and large neural networks (even on quite simple tasks). We find that increasing critic capacity effectively reduces the generalization gap for both the MC and TD methods. These results suggest that the traditional TD inductive bias for stitching may be less necessary in the era of large models for RL and, in some cases, may offer diminishing returns. Additionally, our results suggest that stitching, a form of generalization unique to the RL setting, might be achieved not through specialized algorithms (temporal difference learning) but rather through the same recipe that has provided generalization in other machine learning settings (via scale). Project website: https://michalbortkiewicz.github.io/golden-standard/


A Practical Introduction to Deep Reinforcement Learning

arXiv.org Artificial Intelligence

Abstract: Deep reinforcement learning (DRL) has emerged as a powerful framework for solving sequential decision-making problems, achieving remarkable success in a wide range of applications, including game AI, autonomous driving, biomedicine, and large language models. However, the diversity of algorithms and the complexity of theoretical foundations often pose significant challenges for beginners seeking to enter the field. This tutorial aims to provide a concise, intuitive, and practical introduction to DRL, with a particular focus on the Proximal Policy Optimization (PPO) algorithm, which is one of the most widely used and effective DRL methods. To facilitate learning, we organize all algorithms under the Generalized Policy Iteration (GPI) framework, offering readers a unified and systematic perspective. Instead of lengthy theoretical proofs, we emphasize intuitive explanations, illustrative examples, and practical engineering techniques. This work serves as an efficient and accessible guide, helping readers rapidly progress from basic concepts to the implementation of advanced DRL algorithms. Figure 1: Reinforcement learning has been extensively applied to a wide range of domains.


Reviews: Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning

Neural Information Processing Systems

Here are few suggestions for expanding your presentation: 1. The literature review needs to be broadened. In particular, you should discuss the work of Liu et al., JAIR 2019 (Proximal Gradient TD Learning), which analyzes two time-scale algorithms that include a proximal gradient step. The results in that paper show improved finite sample bounds over classic gradient TD methods, like GTD2. How do your results compare with those in that paper, and in particular, can your analysis be extended to GTD2-MP (the mirror-prox variant of GTD2, which has an improved finite sample convergence rate compared to GTD2. 2. Two time scale algorithms are somewhat more complex than the standard TD method, and Sutton et al. and others have developed a variant of TD called emphatic TD (JMLR 2016: "An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning") that is stable under off-policy training.


The Pitfalls of Regularization in Off-Policy TD Learning

Neural Information Processing Systems

Temporal Difference (TD) learning is ubiquitous in reinforcement learning, where it is often combined with off-policy sampling and function approximation. Unfortunately learning with this combination (known as the deadly triad), exhibits instability and unbounded error. To account for this, modern Reinforcement Learning methods often implicitly (or sometimes explicitly) assume that regularization is sufficient to mitigate the problem in practice; indeed, the standard deadly triad examples from the literature can be fixed'' via proper regularization. In this paper, we introduce a series of new counterexamples to show that the instability and unbounded error of TD methods is not solved by regularization. We demonstrate that, in the off-policy setting with linear function approximation, TD methods can fail to learn a non-trivial value function under any amount of regularization; we further show that regularization can induce divergence under common conditions; and we show that one of the most promising methods to mitigate this divergence (Emphatic TD algorithms) may also diverge under regularization.


Demystifying the Recency Heuristic in Temporal-Difference Learning

arXiv.org Artificial Intelligence

The recency heuristic in reinforcement learning is the assumption that stimuli that occurred closer in time to an acquired reward should be more heavily reinforced. The recency heuristic is one of the key assumptions made by TD($\lambda$), which reinforces recent experiences according to an exponentially decaying weighting. In fact, all other widely used return estimators for TD learning, such as $n$-step returns, satisfy a weaker (i.e., non-monotonic) recency heuristic. Why is the recency heuristic effective for temporal credit assignment? What happens when credit is assigned in a way that violates this heuristic? In this paper, we analyze the specific mathematical implications of adopting the recency heuristic in TD learning. We prove that any return estimator satisfying this heuristic: 1) is guaranteed to converge to the correct value function, 2) has a relatively fast contraction rate, and 3) has a long window of effective credit assignment, yet bounded worst-case variance. We also give a counterexample where on-policy, tabular TD methods violating the recency heuristic diverge. Our results offer some of the first theoretical evidence that credit assignment based on the recency heuristic facilitates learning.


Can We Enhance the Quality of Mobile Crowdsensing Data Without Ground Truth?

arXiv.org Artificial Intelligence

Mobile crowdsensing (MCS) has emerged as a prominent trend across various domains. However, ensuring the quality of the sensing data submitted by mobile users (MUs) remains a complex and challenging problem. To address this challenge, an advanced method is required to detect low-quality sensing data and identify malicious MUs that may disrupt the normal operations of an MCS system. Therefore, this article proposes a prediction- and reputation-based truth discovery (PRBTD) framework, which can separate low-quality data from high-quality data in sensing tasks. First, we apply a correlation-focused spatial-temporal transformer network to predict the ground truth of the input sensing data. Then, we extract the sensing errors of the data as features based on the prediction results to calculate the implications among the data. Finally, we design a reputation-based truth discovery (TD) module for identifying low-quality data with their implications. Given sensing data submitted by MUs, PRBTD can eliminate the data with heavy noise and identify malicious MUs with high accuracy. Extensive experimental results demonstrate that PRBTD outperforms the existing methods in terms of identification accuracy and data quality enhancement.


Eigensubspace of Temporal-Difference Dynamics and How It Improves Value Approximation in Reinforcement Learning

arXiv.org Artificial Intelligence

We propose a novel value approximation method, namely "Eigensubspace Regularized Critic (ERC)" for deep reinforcement learning (RL). ERC is motivated by an analysis of the dynamics of Q-value approximation error in the Temporal-Difference (TD) method, which follows a path defined by the 1-eigensubspace of the transition kernel associated with the Markov Decision Process (MDP). It reveals a fundamental property of TD learning that has remained unused in previous deep RL approaches. In ERC, we propose a regularizer that guides the approximation error tending towards the 1-eigensubspace, resulting in a more efficient and stable path of value approximation. Moreover, we theoretically prove the convergence of the ERC method. Besides, theoretical analysis and experiments demonstrate that ERC effectively reduces the variance of value functions. Among 26 tasks in the DMControl benchmark, ERC outperforms state-of-the-art methods for 20. Besides, it shows significant advantages in Q-value approximation and variance reduction. Our code is available at https://sites.google.com/view/erc-ecml23/.


Temporal-Difference Networks

Neural Information Processing Systems

We introduce a generalization of temporal-difference (TD) learning to networks of interrelated predictions. Rather than relating a single pre- diction to itself at a later time, as in conventional TD methods, a TD network relates each prediction in a set of predictions to other predic- tions in the set at a later time. TD networks can represent and apply TD learning to a much wider class of predictions than has previously been possible. Using a random-walk example, we show that these networks can be used to learn to predict by a fixed interval, which is not possi- ble with conventional TD methods. Secondly, we show that if the inter- predictive relationships are made conditional on action, then the usual learning-efficiency advantage of TD methods over Monte Carlo (super- vised learning) methods becomes particularly pronounced.


Time Distance: A Novel Collision Prediction and Path Planning Method

arXiv.org Artificial Intelligence

In this paper, a new fast algorithm for path planning and a collision prediction framework for two dimensional dynamically changing environments are introduced. The method is called Time Distance (TD) and benefits from the space-time space idea. First, the TD concept is defined as the time interval that must be spent in order for an object to reach another object or a location. Next, TD functions are derived as a function of location, velocity and geometry of objects. To construct the configuration-time space, TD functions in conjunction with another function named "Z-Infinity" are exploited. Finally, an explicit formula for creating the length optimal collision free path is presented. Length optimization in this formula is achieved using a function named "Route Function" which minimizes a cost function. Performance of the path planning algorithm is evaluated in simulations. Comparisons indicate that the algorithm is fast enough and capable to generate length optimal paths as the most effective methods do. Finally, as another usage of the TD functions, a collision prediction framework is presented. This framework consists of an explicit function which is a function of TD functions and calculates the TD of the vehicle with respect to all objects of the environment.